The American Community Survey is a survey run by the US Census Bureau that collects data on everything from the affordability of housing to employment rates for different industries. For this challenge, you'll be using the data derived from the American Community Survey for years 2010-2012. The team at FiveThirtyEight has cleaned the dataset and made it available on their Github repo.
Here's a quick overview of the files we'll be working with:
all-ages.csv - employment data by major for all ages
recent-grads.csv - employment data by major for just recent college graduates
In [2]:
# %sh
# # download source file
# wget https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv
# wget https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv
# ls -l
In [3]:
import pandas as pd
all_ages = pd.read_csv("all-ages.csv")
print all_ages.columns
print all_ages.head(3)
recent_grads = pd.read_csv("recent-grads.csv")
print recent_grads.columns
print recent_grads.head(3)
In [4]:
all_ages_major_categories = {}
recent_grads_major_categories = {}
def calculate_major_cat_totals(df):
counts_dictionary = {}
for cat in df["Major_category"].value_counts().index:
counts_dictionary[cat] = df["Total"][df["Major_category"] == cat].sum()
return counts_dictionary
all_ages_major_categories = calculate_major_cat_totals(all_ages)
recent_grads_major_categories = calculate_major_cat_totals(recent_grads)
print all_ages_major_categories
print recent_grads_major_categories
The press likes to talk a lot about how many college grads are unable to get higher wage, skilled jobs and end up working lower wage, unskilled jobs instead. As a data person, it is your job to be skeptical of any broad claims and analyze relevant data to obtain a more nuanced view. Let's run some basic calculations to explore that idea further.
In [5]:
low_wage_percent = recent_grads["Low_wage_jobs"].astype(float).sum() / recent_grads["Total"].sum()
print low_wage_percent
Both all_ages and recent_grads datasets have 173 rows, corresponding to the 173 college major codes. This enables us to do some comparisons between the two datasets and perform some initial calculations to see how similar or different the statistics of recent college graduates are from those of the entire population.
In [6]:
# All majors, common to both DataFrames
majors = recent_grads['Major'].value_counts().index
recent_grads_lower_emp_count = 0
all_ages_lower_emp_count = 0
for major in majors:
recent_unemp = recent_grads["Unemployment_rate"][recent_grads["Major"] == major].values[0]
all_unemp = all_ages["Unemployment_rate"][all_ages["Major"] == major].values[0]
if recent_unemp < all_unemp:
recent_grads_lower_emp_count += 1
elif recent_unemp > all_unemp:
all_ages_lower_emp_count += 1
print "Recent grads fare better: ", recent_grads_lower_emp_count
print "All ages fare better: ", all_ages_lower_emp_count